This lab introduces you to some of the basic functions in R for plotting and analyzing univariate time series data. Many of the things you learn here will be relevant when we start examining multivariate time series as well. We will begin with the creation and plotting of time series objects in R, and then moves on to decomposition, differencing, and correlation (e.g., ACF, PACF) before ending with fitting and simulation of ARMA models.
We’ll use two publicly available environmental datasets in this lab.
The main one is a time series of the atmospheric concentration of
CO2 collected at the Mauna Loa Observatory in Hawai’i
(ML_CO2.csv). The second is Northern Hemisphere land and
ocean temperature anomalies from NOAA. (NH_temp). You can
download both of them from GitHub via the following code.
## Atmospheric CO2 measured on Mauna Loa, Hawai'i
CO2 <- read.csv("https://raw.githubusercontent.com/SOE592/website/main/lectures/day_01/data/ML_CO2.csv")
## Northern hemisphere temperature anomolies
NH_temp <- read.csv("https://raw.githubusercontent.com/SOE592/website/main/lectures/day_01/data/NH_temp.csv")
Time series plots are an excellent way to begin the process of understanding what sort of process might have generated the data of interest. Traditionally, time series have been plotted with the observed data on the \(y\)-axis and time on the \(x\)-axis. Sequential time points are usually connected with some form of line, but sometimes other plot forms can be a useful way of conveying important information in the time series (e.g., barplots of sea-surface temperature anomolies show nicely the contrasting El Niño and La Niña phenomena).
ts objects and plot.ts()The CO2 data are stored in R as a data.frame
object, but we would like to transform the class to a more user-friendly
format for dealing with time series. Fortunately, the ts()
function will do just that, and return an object of class
ts as well.
Tip: Type ?ts at the command prompt to
see all of function arguments.
In addition to the data themselves, we need to provide
ts() with two pieces of information about the time index
for the data:
frequency is a bit of a misnomer because it does not
really refer to the number of cycles per unit time, but rather the
number of observations/samples per cycle. So, for example, if the data
were collected every hour of a day then
frequency = 24.
start specifies the starting time or date for the
first data point in terms of c(unit, subunit).
So, for example, if the data were collected monthly beginning in
November of 1999, then frequency = 12 and
start = c(1999, 11). If the data were collected annually,
then you simply specify start as a scalar (e.g.,
start = 1991) and omit frequency
(i.e., R will set frequency = 1 by default).
The Mauna Loa time series is collected monthly and begins in March of
1958, which we can get from the data themselves, and then pass to
ts().
## create a time series (ts) object from the CO2 data
co2 <- ts(data = CO2$ppm, frequency = 12,
start = c(CO2[1, "year"], CO2[1, "month"]))
Now let’s plot the data using plot.ts(), which is
designed specifically for ts objects like the one we
just created above. It’s nice because we don’t need to specify any \(x\)-values as they are taken directly from
the ts object.
## plot the ts
plot.ts(co2, ylab = expression(paste("CO"[2], " (ppm)")))
Time series of the atmospheric CO2 concentration at Mauna Loa, Hawai’i measured monthly from March 1958 to present.
Examination of the plotted time series shows two obvious features that would violate any assumption of stationarity:
an increasing (and perhaps non-linear) trend over time, and
strong seasonal patterns.
(Aside: Do you know the causes of these 2 phenomena?)
Before we examine the CO2 data further, however, let’s see
a quick example of how you can combine and plot multiple time series
together. We’ll use the data on monthly mean temperature anomolies for
the Northern Hemisphere (NH_temp).
Task: Begin by converting NH_temp to a
ts object.
## convert temperature data to ts object
temp_ts <- ts(data = NH_temp$Value, frequency = 12,
start = c(1880, 1))
Before we can plot the two time series together, however, we need to
line up their time indices because the temperature data start in January
of 1880, but the CO2 data start in March of 1958.
Fortunately, the ts.intersect() function makes this really
easy once the data have been transformed to ts objects
by trimming the data to a common time frame. Also,
ts.union() works in a similar fashion, but it pads one or
both series with the appropriate number of NA.
Task: Compare the results of
ts.intersect() and ts.union().
## intersection (only overlapping times)
dat_int <- ts.intersect(co2, temp_ts)
## dimensions of common-time data
dim(dat_int)
## [1] 682 2
## union (all times)
dat_unn <- ts.union(co2, temp_ts)
## dimensions of all-time data
dim(dat_unn)
## [1] 1647 2
As you can see, the intersection of the two data sets is much smaller
than the union. If you compare them, you will see that the first 965
rows of dat_unn contains NA in the
co2 column.
It turns out that the regular plot() function in R is
smart enough to recognize a ts object and use the
information contained therein appropriately. Here’s how to plot the
intersection of the two time series together with the y-axes on
alternate sides.
## plot the ts
# plot(dat_int, main = "", yax.flip = TRUE)
plot.ts(dat_int, main = "", yax.flip = TRUE)
Time series of the atmospheric CO2 concentration at Mauna Loa, Hawai’i (top) and the mean temperature index for the Northern Hemisphere (bottom) measured monthly from March 1958 to present.
Plotting time series data is an important first step in analyzing their various components. Beyond that, however, we need a more formal means for identifying and removing characteristics such as a trend or seasonal variation. As discussed in lecture, the decomposition model reduces a time series into 3 components: trend, seasonal effects, and random errors. In turn, we aim to model the random errors as some form of stationary process.
Let’s begin with a simple, additive decomposition model for a time series \(x_t\)
\[\begin{equation} (\#eq:classDecomp) x_t = m_t + s_t + e_t, \end{equation}\]
where, at time \(t\), \(m_t\) is the trend, \(s_t\) is the seasonal effect, and \(e_t\) is a random error that we generally assume to have zero-mean and to be correlated over time. Thus, by estimating and subtracting both \(\{m_t\}\) and \(\{s_t\}\) from \(\{x_t\}\), we hope to have a time series of stationary residuals \(\{e_t\}\).
In lecture we discussed how linear filters are a common way to estimate trends in time series. One of the most common linear filters is the moving average, which for time lags from \(-a\) to \(a\) is defined as
\[\begin{equation} (\#eq:linearFilter) \hat{m}_t = \sum_{k=-a}^{a} \left(\frac{1}{1+2a}\right) x_{t+k}. \end{equation}\]
This model works well for moving windows of odd-numbered lengths, but should be adjusted for even-numbered lengths by adding only \(\frac{1}{2}\) of the 2 most extreme lags so that the filtered value at time \(t\) lines up with the original observation at time \(t\). So, for example, in a case with monthly data such as the atmospheric CO2 concentration where a 12-point moving average would be an obvious choice, the linear filter would be
\[\begin{equation} (\#eq:linearFilterEx) \hat{m}_t = \frac{\frac{1}{2}x_{t-6} + x_{t-5} + \dots + x_{t-1} + x_t + x_{t+1} + \dots + x_{t+5} + \frac{1}{2}x_{t+6}}{12} \end{equation}\]
It is important to note here that our time series of the estimated trend \(\{\hat{m}_t\}\) is actually shorter than the observed time series by \(2a\) units.
Conveniently, R has the built-in function filter() in
the stats package for estimating moving-average (and
other) linear filters. In addition to specifying the time series to be
filtered, we need to pass in the filter weights (and 2 other arguments
we won’t worry about here–type ?filter to get more
information). The easiest way to create the filter is with the
rep() function:
## weights for moving average
fltr <- c(1 / 2, rep(1, times = 11), 1 / 2) / 12
Now let’s get our estimate of the trend \(\{\hat{m}\}\) with filter()}
and plot it:
## estimate of trend
co2_trend <- stats::filter(co2, filter = fltr, method = "convo", sides = 2)
## plot the trend
plot.ts(co2_trend, ylab = "Trend", cex = 1)
The trend is a more-or-less smoothly increasing function over time, the average slope of which does indeed appear to be increasing over time as well (Figure @ref(fig:tslab-plotTrendTSb)).
(ref:tslab-plotTrendTSb) Time series of the estimated trend \(\{\hat{m}_t\}\) for the atmospheric CO2 concentration at Mauna Loa, Hawai’i.
(ref:tslab-plotTrendTSb)
Once we have an estimate of the trend for time \(t\) (\(\hat{m}_t\)) we can easily obtain an estimate of the seasonal effect at time \(t\) (\(\hat{s}_t\)) by subtraction
\[\begin{equation} (\#eq:seasEst) \hat{s}_t = x_t - \hat{m}_t, \end{equation}\]
which is really easy to do in R:
## seasonal effect over time
co2_seas <- co2 - co2_trend
This estimate of the seasonal effect for each time \(t\) also contains the random error \(e_t\), however, which can be seen by plotting the time series and careful comparison of Equations @ref(eq:classDecomp) and @ref(eq:seasEst).
## plot the monthly seasonal effects
plot.ts(co2_seas, ylab = "Seasonal effect", xlab = "Month", cex = 1)
(ref:tslab-plotSeasTSb) Time series of seasonal effects plus random errors for the atmospheric CO2 concentration at Mauna Loa, Hawai’i, measured monthly from March 1958 to present.
(ref:tslab-plotSeasTSb)
We can obtain the overall seasonal effect by averaging the estimates of \(\{\hat{s}_t\}\) for each month and repeating this sequence over all years.
## length of ts
ll <- length(co2_seas)
## frequency (ie, 12)
ff <- frequency(co2_seas)
## number of periods (years); %/% is integer division
periods <- ll %/% ff
## index of cumulative month
index <- seq(1, ll, by = ff) - 1
## get mean by month
mm <- numeric(ff)
for (i in 1:ff) {
mm[i] <- mean(co2_seas[index + i], na.rm = TRUE)
}
## subtract mean to make overall mean = 0
mm <- mm - mean(mm)
Before we create the entire time series of seasonal effects, let’s plot them for each month to see what is happening within a year:
## plot the monthly seasonal effects
plot.ts(mm, ylab = "Seasonal effect", xlab = "Month", cex = 1)
It looks like, on average, that the CO2 concentration is highest in spring (March) and lowest in summer (August) (Figure @ref(fig:tslab-plotSeasMean)). (Aside: Do you know why this is?)
(ref:tslab-plotSeasMean) Estimated monthly seasonal effects for the atmospheric CO2 concentration at Mauna Loa, Hawai’i.
(ref:tslab-plotSeasMean)
Finally, let’s create the entire time series of seasonal effects \(\{\hat{s}_t\}\):
## create ts object for season
co2_seas_ts <- ts(rep(mm, periods + 1)[seq(ll)],
start = start(co2_seas),
frequency = ff
)
The last step in completing our full decomposition model is obtaining the random errors \(\{\hat{e}_t\}\), which we can get via simple subtraction
\[\begin{equation} (\#eq:errorEst) \hat{e}_t = x_t - \hat{m}_t - \hat{s}_t. \end{equation}\]
Again, this is really easy in R:
## random errors over time
co2_err <- co2 - co2_trend - co2_seas_ts
Now that we have all 3 of our model components, let’s plot them together with the observed data \(\{x_t\}\). The results are shown in Figure @ref(fig:tslab-plotTrSeas).
## plot the obs ts, trend & seasonal effect
plot(cbind(co2, co2_trend, co2_seas_ts, co2_err), main = "", yax.flip = TRUE)
(ref:tslab-plotTrSeas) Time series of the observed atmospheric CO2 concentration at Mauna Loa, Hawai’i (top) along with the estimated trend, seasonal effects, and random errors.
(ref:tslab-plotTrSeas)
decompose() for decompositionNow that we have seen how to estimate and plot the various components
of a classical decomposition model in a piecewise manner, let’s see how
to do this in one step in R with the function decompose(),
which accepts a ts object as input and returns an
object of class decomposed.ts.
## decomposition of CO2 data
co2_decomp <- decompose(co2)
co2_decomp is a list with the following elements, which
should be familiar by now:
x: the observed time series \(\{x_t\}\)seasonal: time series of estimated seasonal component
\(\{\hat{s}_t\}\)figure: mean seasonal effect
(length(figure) == frequency(x))trend: time series of estimated trend \(\{\hat{m}_t\}\)random: time series of random errors \(\{\hat{e}_t\}\)type: type of error ("additive" or
"multiplicative")We can easily make plots of the output and compare them to those in Figure @ref(fig:tslab-plotTrSeas):
## plot the obs ts, trend & seasonal effect
plot(co2_decomp, yax.flip = TRUE)
(ref:tslab-plotDecompB) Time series of the observed atmospheric
CO2 concentration at Mauna Loa, Hawai’i (top) along with the
estimated trend, seasonal effects, and random errors obtained with the
function decompose().
(ref:tslab-plotDecompB)
The results obtained with decompose() (Figure
@ref(fig:tslab-plotDecompB)) are identical to those we estimated
previously.
Another nice feature of the decompose() function is that
it can be used for decomposition models with multiplicative
(i.e., non-additive) errors (e.g., if the original
time series had a seasonal amplitude that increased with time). To do,
so pass in the argument type = "multiplicative", which is
set to type = "additive" by default.
An alternative to decomposition for removing trends is differencing. We saw in lecture how the difference operator works and how it can be used to remove linear and nonlinear trends as well as various seasonal features that might be evident in the data. As a reminder, we define the difference operator as
\[\begin{equation} (\#eq:diffDefnA) \nabla x_t = x_t - x_{t-1}, \end{equation}\]
and, more generally, for order \(d\)
\[\begin{equation} (\#eq:diffDefnB) \nabla^d x_t = (1-\BB)^d x_t, \end{equation}\] where B is the backshift operator (i.e., \(\BB^k x_t = x_{t-k}\) for \(k \geq 1\)).
So, for example, a random walk is one of the most simple and widely used time series models, but it is not stationary. We can write a random walk model as
\[\begin{equation} (\#eq:defnRW) x_t = x_{t-1} + w_t, \text{ with } w_t \sim \text{N}(0,q). \end{equation}\]
Applying the difference operator to Equation @ref(eq:defnRW) will yield a time series of Gaussian white noise errors \(\{w_t\}\):
\[\begin{equation} (\#eq:diffRW) \begin{aligned} \nabla (x_t &= x_{t-1} + w_t) \\ x_t - x_{t-1} &= x_{t-1} - x_{t-1} + w_t \\ x_t - x_{t-1} &= w_t \end{aligned} \end{equation}\]
diff() functionIn R we can use the diff() function for differencing a
time series, which requires 3 arguments: x (the data),
lag (the lag at which to difference), and
differences (the order of differencing; \(d\) in Equation @ref(eq:diffDefnB)). For
example, first-differencing a time series will remove a linear trend
(i.e., differences = 1); twice-differencing will
remove a quadratic trend (i.e., differences = 2).
In addition, first-differencing a time series at a lag equal to the
period will remove a seasonal trend (e.g., set
lag = 12 for monthly data).
Let’s use diff() to remove the trend and seasonal signal
from the CO2 time series, beginning with the trend. Close
inspection of Figure @ref(fig:tslab-plotdata1) would suggest that there
is a nonlinear increase in CO2 concentration over time, so
we’ll set differences = 2):
## twice-difference the CO2 data
co2_d2 <- diff(co2, differences = 2)
## plot the differenced data
plot(co2_d2, ylab = expression(paste(nabla^2, "CO"[2])))
(ref:tslab-plotCO2diff2) Time series of the twice-differenced atmospheric CO2 concentration at Mauna Loa, Hawai’i.
(ref:tslab-plotCO2diff2)
We were apparently successful in removing the trend, but the seasonal effect still appears obvious (Figure @ref(fig:tslab-plotCO2diff2)). Therefore, let’s go ahead and difference that series at lag-12 because our data were collected monthly.
## difference the differenced CO2 data
co2_d2d12 <- diff(co2_d2, lag = 12)
## plot the newly differenced data
plot(co2_d2d12,
ylab = expression(paste(nabla, "(", nabla^2, "CO"[2], ")")))
(ref:tslab-plotCO2diff12) Time series of the lag-12 difference of the twice-differenced atmospheric CO2 concentration at Mauna Loa, Hawai’i.
(ref:tslab-plotCO2diff12)
Now we have a time series that appears to be random errors without any obvious trend or seasonal components (Figure @ref(fig:tslab-plotCO2diff12)).
The concepts of covariance and correlation are very important in time series analysis. In particular, we can examine the correlation structure of the original data or random errors from a decomposition model to help us identify possible form(s) of (non)stationary model(s) for the stochastic process.
Autocorrelation is the correlation of a variable with itself at differing time lags. Recall from lecture that we defined the sample autocovariance function (ACVF), \(c_k\), for some lag \(k\) as
\[\begin{equation} (\#eq:ACVF) c_k = \frac{1}{n}\sum_{t=1}^{n-k} \left(x_t-\bar{x}\right) \left(x_{t+k}-\bar{x}\right) \end{equation}\]
Note that the sample autocovariance of \(\{x_t\}\) at lag 0, \(c_0\), equals the sample variance of \(\{x_t\}\) calculated with a denominator of \(n\). The sample autocorrelation function (ACF) is defined as
\[\begin{equation} (\#eq:ACF) r_k = \frac{c_k}{c_0} = \text{Cor}(x_t,x_{t+k}) \end{equation}\]
Recall also that an approximate 95% confidence interval on the ACF can be estimated by
\[\begin{equation} (\#eq:ACF95CI) -\frac{1}{n} \pm \frac{2}{\sqrt{n}} \end{equation}\]
where \(n\) is the number of data points used in the calculation of the ACF.
It is important to remember two things here. First, although the confidence interval is commonly plotted and interpreted as a horizontal line over all time lags, the interval itself actually grows as the lag increases because the number of data points \(n\) used to estimate the correlation decreases by 1 for every integer increase in lag. Second, care must be exercised when interpreting the “significance” of the correlation at various lags because we should expect, a priori, that approximately 1 out of every 20 correlations will be significant based on chance alone.
We can use the acf() function in R to compute the sample
ACF (note that adding the option type = "covariance" will
return the sample auto-covariance (ACVF) instead of the ACF–type
?acf for details). Calling the function by itself will will
automatically produce a correlogram (i.e., a plot of the
autocorrelation versus time lag). The argument lag.max
allows you to set the number of positive and negative lags. Let’s try it
for the CO2 data.
## correlogram of the CO2 data
acf(co2, lag.max = 36)
(ref:tslab-plotACFb) Correlogram of the observed atmospheric
CO2 concentration at Mauna Loa, Hawai’i obtained with the
function acf().
(ref:tslab-plotACFb)
There are 4 things about Figure @ref(fig:tslab-plotACFb) that are noteworthy:
As an alternative to the default plots for acf objects, let’s define a new plot function for acf objects with some better features:
## better ACF plot
plot_acf <- function(ACFobj) {
rr <- ACFobj$acf[-1]
kk <- length(rr)
nn <- ACFobj$n.used
plot(seq(kk), rr,
type = "h", lwd = 2, yaxs = "i", xaxs = "i",
ylim = c(floor(min(rr)), 1), xlim = c(0, kk + 1),
xlab = "Lag", ylab = "Correlation", las = 1
)
abline(h = -1 / nn + c(-2, 2) / sqrt(nn), lty = "dashed", col = "blue")
abline(h = 0)
}
Now we can assign the result of acf() to a variable and
then use the information contained therein to plot the correlogram with
our new plot function.
## acf of the CO2 data
co2_acf <- acf(co2, lag.max = 36)
## correlogram of the CO2 data
plot_acf(co2_acf)
(ref:tslab-plotbetterACF) Correlogram of the observed atmospheric
CO2 concentration at Mauna Loa, Hawai’i obtained with the
function plot_acf().
(ref:tslab-plotbetterACF)
Notice that all of the relevant information is still there (Figure @ref(fig:tslab-plotbetterACF)), but now \(r_0=1\) is not plotted at lag-0 and the lags on the \(x\)-axis are displayed correctly as integers.
Before we move on to the PACF, let’s look at the ACF for some deterministic time series, which will help you identify interesting properties (e.g., trends, seasonal effects) in a stochastic time series, and account for them in time series models–an important topic in this course. First, let’s look at a straight line.
## length of ts
nn <- 100
## create straight line
tt <- seq(nn)
## set up plot area
par(mfrow = c(1, 2))
## plot line
plot.ts(tt, ylab = expression(italic(x[t])))
## get ACF
line.acf <- acf(tt, plot = FALSE)
## plot ACF
plot_acf(line.acf)
(ref:tslab-plotLinearACF) Time series plot of a straight line (left) and the correlogram of its ACF (right).
(ref:tslab-plotLinearACF)
The correlogram for a straight line is itself a linearly decreasing function over time (Figure @ref(fig:tslab-plotLinearACF)).
Now let’s examine the ACF for a sine wave and see what sort of pattern arises.
## create sine wave
tt <- sin(2 * pi * seq(nn) / 12)
## set up plot area
par(mfrow = c(1, 2))
## plot line
plot.ts(tt, ylab = expression(italic(x[t])))
## get ACF
sine_acf <- acf(tt, plot = FALSE)
## plot ACF
plot_acf(sine_acf)
(ref:tslab-plotSineACF) Time series plot of a discrete sine wave (left) and the correlogram of its ACF (right).
(ref:tslab-plotSineACF)
Perhaps not surprisingly, the correlogram for a sine wave is itself a sine wave whose amplitude decreases linearly over time (Figure @ref(fig:tslab-plotSineACF)).
Now let’s examine the ACF for a sine wave with a linear downward trend and see what sort of patterns arise.
## create sine wave with trend
tt <- sin(2 * pi * seq(nn) / 12) - seq(nn) / 50
## set up plot area
par(mfrow = c(1, 2))
## plot line
plot.ts(tt, ylab = expression(italic(x[t])))
## get ACF
sili_acf <- acf(tt, plot = FALSE)
## plot ACF
plot_acf(sili_acf)
(ref:tslab-plotSiLiACF) Time series plot of a discrete sine wave (left) and the correlogram of its ACF (right).
(ref:tslab-plotSiLiACF)
The correlogram for a sine wave with a trend is itself a nonsymmetrical sine wave whose amplitude and center decrease over time (Figure @ref(fig:tslab-plotSiLiACF)).
As we have seen, the ACF is a powerful tool in time series analysis for identifying important features in the data. As we will see later, the ACF is also an important diagnostic tool for helping to select the proper order of \(p\) and \(q\) in ARMA(\(p\),\(q\)) models.
The partial autocorrelation function (PACF) measures the linear correlation of a series \(\{x_t\}\) and a lagged version of itself \(\{x_{t+k}\}\) with the linear dependence of \(\{x_{t-1},x_{t-2},\dots,x_{t-(k-1)}\}\) removed. Recall from lecture that we define the PACF as
\[\begin{equation} (\#eq:PACFdefn) f_k = \begin{cases} \text{Cor}(x_1,x_0)=r_1 & \text{if } k = 1;\\ \text{Cor}(x_k-x_k^{k-1},x_0-x_0^{k-1}) & \text{if } k \geq 2; \end{cases} \end{equation}\]
with
It’s easy to compute the PACF for a variable in R using the
pacf() function, which will automatically plot a
correlogram when called by itself (similar to acf()). Let’s
look at the PACF for the CO2 data.
## PACF of the CO2 data
pacf(co2, lag.max = 36)
The default plot for PACF is a bit better than for ACF, but here is another plotting function that might be useful.
## better PACF plot
plot_pacf <- function(PACFobj) {
rr <- PACFobj$acf
kk <- length(rr)
nn <- PACFobj$n.used
plot(seq(kk), rr,
type = "h", lwd = 2, yaxs = "i", xaxs = "i",
ylim = c(floor(min(rr)), 1), xlim = c(0, kk + 1),
xlab = "Lag", ylab = "PACF", las = 1
)
abline(h = -1 / nn + c(-2, 2) / sqrt(nn), lty = "dashed", col = "blue")
abline(h = 0)
}
(ref:tslab-plotPACFb) Correlogram of the PACF for the observed
atmospheric CO2 concentration at Mauna Loa, Hawai’i obtained
with the function pacf().
(ref:tslab-plotPACFb)
Notice in Figure @ref(fig:tslab-plotPACFb) that the partial autocorrelation at lag-1 is very high (it equals the ACF at lag-1), but the other values at lags > 1 are relatively small, unlike what we saw for the ACF. We will discuss this in more detail later on in this lab.
Notice also that the PACF plot again has real-valued indices for the
time lag, but it does not include any value for lag-0 because it is
impossible to remove any intermediate autocorrelation between \(t\) and \(t-k\) when \(k=0\), and therefore the PACF does not
exist at lag-0. If you would like, you can use the
plot_acf() function we defined above to plot the PACF
estimates because acf() and pacf() produce
identical list structures (results not shown here).
## PACF of the CO2 data
co2_pacf <- pacf(co2)
## correlogram of the CO2 data
plot_acf(co2_pacf)
As with the ACF, we will see later on how the PACF can also be used to help identify the appropriate order of \(p\) and \(q\) in ARMA(\(p\),\(q\)) models.
Often we are interested in looking for relationships between 2 different time series. There are many ways to do this, but a simple method is via examination of their cross-covariance and cross-correlation.
We begin by defining the sample cross-covariance function (CCVF) in a manner similar to the ACVF, in that
\[\begin{equation} (\#eq:CCVF) g_k^{xy} = \frac{1}{n}\sum_{t=1}^{n-k} \left(y_t-\bar{y}\right) \left(x_{t+k}-\bar{x}\right), \end{equation}\]
but now we are estimating the correlation between a variable \(y\) and a different time-shifted variable \(x_{t+k}\). The sample cross-correlation function (CCF) is then defined analogously to the ACF, such that
\[\begin{equation} (\#eq:CCF) r_k^{xy} = \frac{g_k^{xy}}{\sqrt{\text{SD}_x\text{SD}_y}}; \end{equation}\]
SD\(_x\) and SD\(_y\) are the sample standard deviations of \(\{x_t\}\) and \(\{y_t\}\), respectively. It is important to re-iterate here that \(r_k^{xy} \neq r_{-k}^{xy}\), but \(r_k^{xy} = r_{-k}^{yx}\). Therefore, it is very important to pay particular attention to which variable you call \(y\) (i.e., the “response”) and which you call \(x\) (i.e., the “predictor”).
As with the ACF, an approximate 95% confidence interval on the CCF can be estimated by
\[\begin{equation} (\#eq:CCF95CI) -\frac{1}{n} \pm \frac{2}{\sqrt{n}} \end{equation}\]
where \(n\) is the number of data points used in the calculation of the CCF, and the same assumptions apply to its interpretation.
Computing the CCF in R is easy with the function ccf()
and it works just like acf(). In fact, ccf()
is just a “wrapper” function that calls acf(). As an
example, let’s examine the CCF between sunspot activity and number of
lynx trapped in Canada as in the classic paper by Moran.
To begin, let’s get the data, which are conveniently included in the
datasets package included as part of the base
installation of R. Before calculating the CCF, however, we need to find
the matching years of data. Again, we’ll use the
ts.intersect() function.
## get the matching years of sunspot data
suns <- ts.intersect(lynx, sunspot.year)[, "sunspot.year"]
## get the matching lynx data
lynx <- ts.intersect(lynx, sunspot.year)[, "lynx"]
Here are plots of the time series.
## plot time series
plot(cbind(suns, lynx), yax.flip = TRUE)
(ref:tslab-plotSunsLynx) Time series of sunspot activity (top) and lynx trappings in Canada (bottom) from 1821-1934.
(ref:tslab-plotSunsLynx)
It is important to remember which of the 2 variables you call \(y\) and \(x\) when calling
ccf(x, y, ...). In this case, it seems most relevant to
treat lynx as the \(y\) and sunspots as
the \(x\), in which case we are mostly
interested in the CCF at negative lags (i.e., when sunspot
activity predates inferred lynx abundance). Furthermore, we’ll use
log-transformed lynx trappings.
## CCF of sunspots and lynx
ccf(suns, log(lynx), ylab = "Cross-correlation")
(ref:tslab-plotCCFb) CCF for annual sunspot activity and the log of the number of lynx trappings in Canada from 1821-1934.
(ref:tslab-plotCCFb)
From Figures @ref(fig:tslab-plotSunsLynx) and @ref(fig:tslab-plotCCFb) it looks like lynx numbers are relatively low 3-5 years after high sunspot activity (i.e., significant correlation at lags of -3 to -5).